Introduction to Python is brought to you by the Centre for the Analysis of Genome Evolution & Function (CAGEF) bioinformatics training initiative. This course was developed based on feedback on the needs and interests of the Department of Cell & Systems Biology and the Department of Ecology and Evolutionary Biology.
The structure of this course is a code-along style; it is 100% hands on! A few hours prior to each lecture, the materials will be avaialable for download at QUERCUS. The teaching materials will consist of a Jupyter Notebook with concepts, comments, instructions, and blank spaces that you will fill out with Python code along with the instructor. Other teaching materials include an HTML version of the notebook, and datasets to import into Python - when required. This learning approach will allow you to spend the time coding and not taking notes!
As we go along, there will be some in-class challenge questions for you to solve either individually or in cooperation with your peers. Post lecture assessments will also be available (see syllabus for grading scheme and percentages of the final mark).
We'll take a blank slate approach here to Python and assume that you pretty much know nothing about programming. From the beginning of this course to the end, we want to get you from some potential scenarios:
A pile of data (like an excel file or tab-separated file) full of experimental observations and you don't know what to do with it.
Maybe you're manipulating large tables all in excel, making custom formulas and pivot table with graphs. Now you have to repeat similar experiments and do the analysis again.
You're generating high-throughput data and there aren't any bioinformaticians around to help you sort it out.
You heard about Python and what it could do for your data analysis but don't know what that means or where to start.
and get you to a point where you can:
Format your data correctly for analysis
Produce basic plots and perform exploratory analysis
Make functions and scripts for re-analysing existing or new data sets
Track your experiments in a digital notebook like Jupyter!
Welcome to this third lecture in a series of six. Today you will dive into more detailed data structures, packages that works with them, and build up towards our a more standard data science structure - the DataFrame.
At the end of this lecture we will aim to have covered the following topics:
grey background - a package, function, code, command or directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or folder
bold - heading or a term that is being defined
blue text - named or unnamed hyperlink
... - Within each coding cell this will indicate an area of code that students will need to complete for the code cell to run correctly.
Today's datasets will focus on using Python lists and the NumPy package
An Excel book of five sheets, where each sheets is a dataset. We will use this dataset to demonstrate how to import Excel files into Python using Pandas.
This is a dataset resulting from the phase 1 of the Human Microbiome Project (HMP). The HMP was a joint effort led by the American Institute of Health to characterize the microbiome of healthy human subjects at five major body sites using metagenomics (A.K.A. environmental DNA) and 16S ribosomal RNA amplicon gene sequencing. The dataset that we will use are the results of 16S rRNA gene sequencing, a technique where this gene is used as a marker for taxonomic identification and classification.
In very general terms, the microbial DNA is extracted from a sample and the 16S rRNA gene is amplified using polymerase chain reaction (PCR), the gene amplicons sequenced, and the resulting sequences are matched against a reference database. The results are presented as counts of Operational Taxonomic Units (OTU), which are the equivalent to microbial species for data interpretation.
There are two datasets located in the data folder which we're interested in:
human_microbiome_project_metadata.csv
human_microbiome_project_otu_taxa_table.csv
These files, however, are excessively large - especially the metadata file. Therefore we'll be using a subset of the those files called:
human_microbiome_project_metadata_subset.csv
human_microbiome_project_otu_taxa_table_subset.csv
This is a copy of our formatted and merged data. It's a smaller subset to the final for doing some simple exploratory data analysis towards the end of class.
IPython and InteractiveShell will be access just to set the behaviour we want for iPython so we can see multiple code outputs per code cell.
random is a package with methods to add pseudorandomness to programs
numpy provides a number of mathematical functions as well as the special data class of arrays which we'll be learning about today.
os
pandas
matplolib
# ----- Always run this at the beginning of class so we can get multi-command output ----- #
# Access options from the iPython core
from IPython.core.interactiveshell import InteractiveShell
# Change the value of ast_node_interactivity
InteractiveShell.ast_node_interactivity = "all"
# ----- Additional packages we want to import for class ----- #
# Import the pandas package
import pandas as pd
!pip install openpyxl
Requirement already satisfied: openpyxl in c:\users\mokca\anaconda3\lib\site-packages (3.0.7) Requirement already satisfied: et-xmlfile in c:\users\mokca\anaconda3\lib\site-packages (from openpyxl) (1.1.0)
Wide and long (sometimes un-stacked and stacked, or wide and tall, wide and narrow), are terms used to describe how a dataset is formated.
In a long formated dataset, each column is a variable and the results of each measured variable are stored in rows (observations). In contrast, in a wide formated data not every column is necessarily a variable so you can have several observations of the same type of variable in the same row. The names long and wide come from the general shape that the two data formats have.
For data science applications, long format is preferred over wide format because it allows for easier and more efficient computations, data subsetting and manipulation. Wide format is more friendly to the human eye and easier to work with when data needs to be manually recorded/input. Therefore, having the ability to interconvert between these two data formats is a valuable and required skill. The following is a general scheme of wide- (left) and long-format (right) datasets:
To read more about wide and long formats, visit https://eagereyes.org/basics/spreadsheet-thinking-vs-database-thinking.
Why tidy data?
Data cleaning (or dealing with 'messy' data) accounts for a huge chunk of data scientist's time. Ultimately, we want to get our data into a 'tidy' format (long format) where it is easy to manipulate, model and visualize. Having a consistent data structure and tools that work with that data structure can help this process along.
Tidy data has:
This seems pretty straight forward, and it is. It is the datasets you get that will not be straight forward. Having a map of where to take your data is helpful to unraveling its structure and getting it into a usable format.
The 5 most common problems with messy datasets are:
Fortunately there are some tools available to solve these problems.
In our case, we have two datasets that we've already imported with our packages:
hmp_metadata¶data/human_microbiome_project_metadata.csvA comma-separated file with information about the OTUs, such as the number of times a OTU was detected (count), part of the body where the sample came from, the subject's sex, etc. 43,149 Rows and 2,898 columns
hmp_taxa¶data/human_microbiome_project_otu_taxa_table.csvA dataset with taxonomic information of each OTU. Seven Rows and 43141 columns.
We will use Pandas to merge the two files into a single dataset and will introduce concepts of data wrangling along the way.
These datasets are part of the R package HMP16SData. Details on how to prepare the data for download can be found in the HMP16SData vignette. To read more about the HMP project, visit https://www.hmpdacc.org/.
Given the popularity of Excel, it is common to find ourselves working with Excel books. Before importing the file, we need to know our current working directory (where in the computer are we working from?), change the directory if necessary, and list the files available for importation. For those purposes we use the built-in OS library
os package provides access to the operating system¶In order to access functions and files in your operating system, the os package provides utilities that facilitate navigating through directories and accessing system information. Here's a table with some helpful functions and their description.
| Function | Description |
|---|---|
| getcwd() | Retrieve the current working directory. This is the location where Python thinks you're working from. This is usually the directory holding your Jupyter notebook. |
| chdir() | Change the current working directory. This can be an absolute path (ie C:/Users/yourname/) or relative to where you currently are (ie data/) |
| listdir() | List the files and folders in the current directory (default) or in the directory specified as a parameter. |
| mkdir() | Make a new directory (aka folder) in the current directory (default) or by providing a relative or absolute path. Note: you will get an error if attempting to make a directory that already exists. |
| rename() | Rename a folder or file. You can even move things to new directories as long as the path exists. |
| rmdir() | Remove a directory but will give an error if the directory does not exist |
| remove() | Remove a file (path) as long as it exists. |
# Import the os package. No need to rename it.
import os
# What is my current working directory?
# Use the print command to make it more readable
print(os.getcwd())
C:\Users\mokca\Dropbox\!CAGEF\Course_Materials\Introduction_to_Python\2021.06_Intro_Python\Lecture_03_Data_Formatting
# Change current working directory if needed
os.chdir("data/")
print(os.getcwd())
C:\Users\mokca\Dropbox\!CAGEF\Course_Materials\Introduction_to_Python\2021.06_Intro_Python\Lecture_03_Data_Formatting\data
# List files in directory
os.listdir()
['data_writing_test.csv', 'data_writing_test.tsv', 'data_writing_test.xlsx', 'human_microbiome_project_metadata.csv', 'human_microbiome_project_metadata_subset.csv', 'human_microbiome_project_otu_table.csv', 'human_microbiome_project_otu_taxa_table.csv', 'human_microbiome_project_otu_taxa_table_subset.csv', 'human_microbiome_project_taxa_table.csv', 'miscellaneous.xlsx', 'subset_taxa_metadata_merged.csv']
# Create or make a directory
os.mkdir("test_directory")
# Check out how that's changed the directory
os.listdir()
['data_writing_test.csv', 'data_writing_test.tsv', 'data_writing_test.xlsx', 'human_microbiome_project_metadata.csv', 'human_microbiome_project_metadata_subset.csv', 'human_microbiome_project_otu_table.csv', 'human_microbiome_project_otu_taxa_table.csv', 'human_microbiome_project_otu_taxa_table_subset.csv', 'human_microbiome_project_taxa_table.csv', 'miscellaneous.xlsx', 'subset_taxa_metadata_merged.csv', 'test_directory']
# Rename a directory or file
os.rename(src = "test_directory", dst = "renamed_directory")
os.listdir()
['data_writing_test.csv', 'data_writing_test.tsv', 'data_writing_test.xlsx', 'human_microbiome_project_metadata.csv', 'human_microbiome_project_metadata_subset.csv', 'human_microbiome_project_otu_table.csv', 'human_microbiome_project_otu_taxa_table.csv', 'human_microbiome_project_otu_taxa_table_subset.csv', 'human_microbiome_project_taxa_table.csv', 'miscellaneous.xlsx', 'renamed_directory', 'subset_taxa_metadata_merged.csv']
os.rmdir("renamed_directory/") # Remove a directory
os.listdir()
['data_writing_test.csv', 'data_writing_test.tsv', 'data_writing_test.xlsx', 'human_microbiome_project_metadata.csv', 'human_microbiome_project_metadata_subset.csv', 'human_microbiome_project_otu_table.csv', 'human_microbiome_project_otu_taxa_table.csv', 'human_microbiome_project_otu_taxa_table_subset.csv', 'human_microbiome_project_taxa_table.csv', 'miscellaneous.xlsx', 'subset_taxa_metadata_merged.csv']
read_excel() to import your data¶So we covered an important series of steps that are helpful in the conversion of our data. Sometimes data might have other issues, which is all part of data wrangling. To simplify our process, however, the pandas package has already created a function to deal with headers and other issues with importing .xlsx files and yes, it runs openpyxl under the hood.
By default read_excel() will use the first sheet in the workbook but we can also use the sheet_name parameter with the actual sheet name, or it's sheet position.
# Read the file in from the data folder to a DataFrame with the first sheet as default
pd.read_excel("miscellaneous.xlsx")
| ASV | abundance | compound | salinity | group | replicate | kingdom | phylum | class | order | family | genus | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTC... | 40.69 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Gammaproteobacteria | Pseudomonadales | Pseudomonadaceae | Pseudomonas |
| 1 | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTC... | 11.71 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Alphaproteobacteria | Rhodospirillales | Rhodospirillaceae | Candidatus |
| 2 | GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTC... | 11.13 | compound-free | brackish | control | 1 | Bacteria | Firmicutes | Clostridia | Clostridiales | Lachnospiraceae | Lachnoclostridium |
| 3 | GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTC... | 6.14 | compound-free | brackish | control | 1 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Carnobacteriaceae | Trichococcus |
| 4 | GAATTGACGGGGGCCCGCACAAGCGGAGGAACATGTGGTTTAATTC... | 3.97 | compound-free | brackish | control | 1 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Porphyromonadaceae | Proteiniphilum |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6651 | GCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCC... | 0.00 | toluene | saline | treatment | 3 | Bacteria | Proteobacteria | Gammaproteobacteria | Alteromonadales | Pseudoalteromonadaceae | Pseudoalteromonas |
| 6652 | GCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCA... | 0.00 | toluene | saline | treatment | 3 | Bacteria | Firmicutes | Clostridia | Clostridiales | Clostridiaceae.4 | Caminicella |
| 6653 | GCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCG... | 0.00 | toluene | saline | treatment | 3 | Bacteria | Firmicutes | Clostridia | Clostridiales | Lachnospiraceae | Vallitalea |
| 6654 | GCAGCAGTGGGGAATATTGCGCAATGGGGGAAACCCTGACGCAGCA... | 0.00 | toluene | saline | treatment | 3 | Bacteria | Firmicutes | Clostridia | Clostridiales | Peptococcaceae | Desulfitibacter |
| 6655 | GCAGCAGTGGGGAATCTTGCGCAATGGGGGAAACCCTGACGCAGCA... | 0.00 | toluene | saline | treatment | 3 | Bacteria | Firmicutes | Clostridia | Thermoanaerobacterales | Thermoanaerobacteraceae | Moorella |
6656 rows × 12 columns
# Read a specific sheet in based on sheet name
pd.read_excel("miscellaneous.xlsx", sheet_name="All alphabetised")
| Title | No of mentions | |
|---|---|---|
| 0 | 1974 | NaN |
| 1 | 1977 | NaN |
| 2 | 1984 | NaN |
| 3 | 1984 | NaN |
| 4 | 1984 | NaN |
| ... | ... | ... |
| 1998 | Zazie in the Metro | 2.0 |
| 1999 | Zen and the Art of Motorcycle Maintenance | NaN |
| 2000 | Zen and the Art of Motorcycle Maintenance | 2.0 |
| 2001 | Zorba the Greek | NaN |
| 2002 | Zuleika Dobson | NaN |
2003 rows × 2 columns
ExcelFile() class to get a list of sheet names¶Although we can read in sheets by position, unless we know the sheet names a priori it makes it hard to call on a sheet name specifically. We already saw that it can be done quite easily with the openpyxl package but we can also intialize an ExcelFile object which will have the .sheet_names property.
In fact, over the years, the function ExcelFile.parse() has become virtually identical to the read_excel() function if we were to take a look under the hood.
You can also select some specific sheets to read in
# Get a list of sheet names from our xlsx file
microbes_xl = pd.ExcelFile('miscellaneous.xlsx')
# What is this object?
type(microbes_xl)
# List the sheet names
microbes_xl.sheet_names
pandas.io.excel._base.ExcelFile
['microbes', 'Lists', 'All alphabetised', 'Top titles', 'dropoff']
# Take our excel object and parse a specific file
microbes_xl.parse(sheet_name = "microbes")
# Close the file when we're done with it
microbes_xl.close()
# Here's equivalent code if you only want one sheet at a time
pd.ExcelFile('miscellaneous.xlsx').parse(sheet_name = "microbes")
| ASV | abundance | compound | salinity | group | replicate | kingdom | phylum | class | order | family | genus | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTC... | 40.69 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Gammaproteobacteria | Pseudomonadales | Pseudomonadaceae | Pseudomonas |
| 1 | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTC... | 11.71 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Alphaproteobacteria | Rhodospirillales | Rhodospirillaceae | Candidatus |
| 2 | GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTC... | 11.13 | compound-free | brackish | control | 1 | Bacteria | Firmicutes | Clostridia | Clostridiales | Lachnospiraceae | Lachnoclostridium |
| 3 | GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTC... | 6.14 | compound-free | brackish | control | 1 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Carnobacteriaceae | Trichococcus |
| 4 | GAATTGACGGGGGCCCGCACAAGCGGAGGAACATGTGGTTTAATTC... | 3.97 | compound-free | brackish | control | 1 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Porphyromonadaceae | Proteiniphilum |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6651 | GCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCC... | 0.00 | toluene | saline | treatment | 3 | Bacteria | Proteobacteria | Gammaproteobacteria | Alteromonadales | Pseudoalteromonadaceae | Pseudoalteromonas |
| 6652 | GCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCA... | 0.00 | toluene | saline | treatment | 3 | Bacteria | Firmicutes | Clostridia | Clostridiales | Clostridiaceae.4 | Caminicella |
| 6653 | GCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCG... | 0.00 | toluene | saline | treatment | 3 | Bacteria | Firmicutes | Clostridia | Clostridiales | Lachnospiraceae | Vallitalea |
| 6654 | GCAGCAGTGGGGAATATTGCGCAATGGGGGAAACCCTGACGCAGCA... | 0.00 | toluene | saline | treatment | 3 | Bacteria | Firmicutes | Clostridia | Clostridiales | Peptococcaceae | Desulfitibacter |
| 6655 | GCAGCAGTGGGGAATCTTGCGCAATGGGGGAAACCCTGACGCAGCA... | 0.00 | toluene | saline | treatment | 3 | Bacteria | Firmicutes | Clostridia | Thermoanaerobacterales | Thermoanaerobacteraceae | Moorella |
6656 rows × 12 columns
| ASV | abundance | compound | salinity | group | replicate | kingdom | phylum | class | order | family | genus | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTC... | 40.69 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Gammaproteobacteria | Pseudomonadales | Pseudomonadaceae | Pseudomonas |
| 1 | GAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTC... | 11.71 | compound-free | brackish | control | 1 | Bacteria | Proteobacteria | Alphaproteobacteria | Rhodospirillales | Rhodospirillaceae | Candidatus |
| 2 | GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTC... | 11.13 | compound-free | brackish | control | 1 | Bacteria | Firmicutes | Clostridia | Clostridiales | Lachnospiraceae | Lachnoclostridium |
| 3 | GAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTC... | 6.14 | compound-free | brackish | control | 1 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Carnobacteriaceae | Trichococcus |
| 4 | GAATTGACGGGGGCCCGCACAAGCGGAGGAACATGTGGTTTAATTC... | 3.97 | compound-free | brackish | control | 1 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Porphyromonadaceae | Proteiniphilum |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6651 | GCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCC... | 0.00 | toluene | saline | treatment | 3 | Bacteria | Proteobacteria | Gammaproteobacteria | Alteromonadales | Pseudoalteromonadaceae | Pseudoalteromonas |
| 6652 | GCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCA... | 0.00 | toluene | saline | treatment | 3 | Bacteria | Firmicutes | Clostridia | Clostridiales | Clostridiaceae.4 | Caminicella |
| 6653 | GCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCG... | 0.00 | toluene | saline | treatment | 3 | Bacteria | Firmicutes | Clostridia | Clostridiales | Lachnospiraceae | Vallitalea |
| 6654 | GCAGCAGTGGGGAATATTGCGCAATGGGGGAAACCCTGACGCAGCA... | 0.00 | toluene | saline | treatment | 3 | Bacteria | Firmicutes | Clostridia | Clostridiales | Peptococcaceae | Desulfitibacter |
| 6655 | GCAGCAGTGGGGAATCTTGCGCAATGGGGGAAACCCTGACGCAGCA... | 0.00 | toluene | saline | treatment | 3 | Bacteria | Firmicutes | Clostridia | Thermoanaerobacterales | Thermoanaerobacteraceae | Moorella |
6656 rows × 12 columns
Suppose you wanted to select multiple sheets. The read_excel() function will parse through a list [] of sheet indices and names to store them as entries in a dictionary object using the sheet_name arguments as keys. With this parameter you can:
None to open all sheets in your file# Read in multiple sheets
microbes_xl_dict = pd.read_excel('miscellaneous.xlsx', sheet_name= [1, 2, "dropoff"])
# What is the object type?
type(microbes_xl_dict)
# Look at the keys
microbes_xl_dict.keys()
dict
dict_keys([1, 2, 'dropoff'])
microbes_xl_dict[2]
| Title | No of mentions | |
|---|---|---|
| 0 | 1974 | NaN |
| 1 | 1977 | NaN |
| 2 | 1984 | NaN |
| 3 | 1984 | NaN |
| 4 | 1984 | NaN |
| ... | ... | ... |
| 1998 | Zazie in the Metro | 2.0 |
| 1999 | Zen and the Art of Motorcycle Maintenance | NaN |
| 2000 | Zen and the Art of Motorcycle Maintenance | 2.0 |
| 2001 | Zorba the Greek | NaN |
| 2002 | Zuleika Dobson | NaN |
2003 rows × 2 columns
# load all sheets from a book
pd.read_excel('miscellaneous.xlsx', sheet_name=None, header=0).keys() # Header uses row numbers with 0 indexation
dict_keys(['microbes', 'Lists', 'All alphabetised', 'Top titles', 'dropoff'])
Next, we will import the HMP files we discussed at the top of class and reformat them in such way that we can join the two files into a single, long-format Pandas data frame. The image below is a screenshot of the two files that make up of the HMP dataset: One file with metadata and OTU counts, and one with the OTU-taxa information. How would you do the joining? What could we use as a key to join the two datasets? (Think of this question as "what do the two datasets have in common?).
Importing and exploring your dataset, and designing a data wrangling strategy, is something that you should always do before the start of reshaping your dataset.
Use the next 10 minutes to explore the files using Excel and to propose a data wrangling strategy or even start reshaping your data. Avoid scrolling down for now as the answers are in this notebook. Also, make sure to make copies of the two files and use those for your exploration. (Data inspection is also done in Python, of course, but for the sake of this exercise, use Excel.)
It's a bit hard to see but here's where we want to be at the end. Remember our end objective will be to merge these two datasets into a single long-format dataset.
We'll talk in more detail about how to get there but let's talk general strategy first.
Here is a general overview on how to convert the files into long format and joining them into a single data frame:
Import and explore the file to identify issues that might be present with your data. What general problems may arise as you wrangle this dataset?
Create a subset of your data. Sometimes you are working with VERY large datasets. The process of wrangling a large file can take a lot of processing power and time. You don't want to spend your time experimenting on the entire dataset to come up with a customized procedure to wrangle it.
Remove unwanted data. Some columns or rows may be irrelevant to your analysis or unuseable due to missing data discovered in step 1. Eliminate these as needed.
Convert your dataset into long format to help conform with tidy data principles.
Data preprocessing, a.k.a. data wrangling or data munging, is the phase in which the data is prepared for downstream data analysis. In involves data cleaning, formating, handling of special characters in datasets, etc. Preprocessing datasets can be labor intensive depending on the initial state of the data, as we will see today, but Pandas simplifies this process to a great extent. "Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive."
Let's get down to it. First, import Pandas and Numpy
import numpy as np
# We have already imported pandas but do it again so the sections are independent
import pandas as pd
read_csv()¶Now we are all set up to tackle file 1: human_microbiome_project_metadata_subset.csv. This file contains the OTUs count and metadata (data about the data). We'll import the data with the pandas read_csv() function which will produce a DataFrame object from the import. Some useful parameters include
filepath_or_buffer: the source or location of our datasep or delimiter: a string denoting how your data are separated usually a , (default), \s (space) or \t (tab).header: the row number(s) to use as the column names.index_col: the column(s) to use as index (row labels) for your DataFrame.na_values: a list of values that can be considered NaN in addition to a comprehensive default list.# A quick tour through the read_csv function
pd.read_csv?
# na_values=' ' makes empty cells into NaN
hmp_metadata = pd.read_csv("human_microbiome_project_metadata_subset.csv", na_values=' ')
# Take a peek at the table
hmp_metadata.head()
| Unnamed: 0 | PSN | RSID | VISITNO | SEX | RUN_CENTER | HMP_BODY_SITE | HMP_BODY_SUBSITE | SRS_SAMPLE_ID | OTU_97_1 | ... | OTU_97_1182 | OTU_97_11820 | OTU_97_11822 | OTU_97_11823 | OTU_97_11824 | OTU_97_11825 | OTU_97_11826 | OTU_97_11827 | OTU_97_11828 | OTU_97_11829 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 700013549 | 158013734 | 1 | Female | BCM | Gastrointestinal Tract | Stool | SRS012191 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 2 | 700014386 | 158398106 | 1 | Male | BCM,BI | Gastrointestinal Tract | Stool | NaN | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 3 | 700014403 | 158398106 | 1 | Male | BCM,BI | Oral | Saliva | NaN | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 700014409 | 158398106 | 1 | Male | BCM,BI | Oral | Tongue Dorsum | NaN | 0 | ... | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 700014412 | 158398106 | 1 | Male | BCM,BI | Oral | Hard Palate | NaN | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 2000 columns
NaN¶Taking a cursory glance at the DataFrame, we can see that Unnamed column we mentioned in our strategy but we also see that the 8th column SRS_SAMPLE_ID has a number of NaN values. It appears that we do not have some information in that column and so it was imported with a value of NaN. Perhaps because some samples were never submitted to the NCBI database or for any other reason, so we will take no action.
We should note that by default read_csv() and similar commands will recognize not only empty space ('') but other potential variants of empty values like NA, N/A etc and give these the NaN value. Furthermore, from our above command, you can set additional values to be recognized as NaN using the na_values parameter. In our case, we set a single space ' ' as a recognizable missing value. You can even set a dictionary of missing value characters specific to each column. We'll talk more about missing data soon.
For now, let's take a closer look at the SRS_SAMPLE_ID column. Recall we can index columns by name (.loc[colName]) or by position (.iloc[position])
# Let's take a look at just a few entries from column 8
hmp_metadata.iloc[:, [8]].head(10)
# Recall what providing a [] list to indexing does?
| SRS_SAMPLE_ID | |
|---|---|
| 0 | SRS012191 |
| 1 | NaN |
| 2 | NaN |
| 3 | NaN |
| 4 | NaN |
| 5 | NaN |
| 6 | NaN |
| 7 | NaN |
| 8 | NaN |
| 9 | NaN |
As mentioned before, inspecting your data to understand aspects such as types, missing values, and shape, is a crucial part of data preprocessing. Recall that methods and properties like info(), head(), and .shape, are valuable starting points.
# Grab some summary information on our DataFrame
hmp_metadata.info()
# If a DataFrame has too many columns, only summarised information will be printed
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Columns: 2000 entries, Unnamed: 0 to OTU_97_11829 dtypes: int64(1995), object(5) memory usage: 15.3+ MB
# What are the dimensions of our data?
hmp_metadata.shape
# shape is a property of data frames, reason why we don't use parenthesis
(1000, 2000)
isna() and any() methods¶Although we quickly found NaN values in one of our rows using just a cursory inspection, there may be othe NaN values imported that we haven't seen or expected. We can check for more NaN values with the isna() DataFrame method which will return a same-sized object where any NaN or similar values are mapped to the value True.
To accomplish this task we'll also apply the DataFrame.any() method in-line with our isna(). The any() function is a core Python function that checks if any of the elements within an iterable (such as a list) have the value True. In the case of the DataFrame class, the any() method will evaluate along the specific axis. The any() method parameters include:
axis: indicates which axis should be reduced. 0 or index (default) reduces the index dimension and returns a Series whose index is the original column labels1 or columns reduces the columns to return a Series whose index is the original indexNone reduces all dimensions to return a single boolean scalarbool_only: include only boolean columns, default is None (use everything)We'll also use the .columns property for the first time which will return an Index object containing all of the column names.
So, how many columns contain NaN?
# Call isna() on just the DataFrame
hmp_metadata.isna()
| Unnamed: 0 | PSN | RSID | VISITNO | SEX | RUN_CENTER | HMP_BODY_SITE | HMP_BODY_SUBSITE | SRS_SAMPLE_ID | OTU_97_1 | ... | OTU_97_1182 | OTU_97_11820 | OTU_97_11822 | OTU_97_11823 | OTU_97_11824 | OTU_97_11825 | OTU_97_11826 | OTU_97_11827 | OTU_97_11828 | OTU_97_11829 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 1 | False | False | False | False | False | False | False | False | True | False | ... | False | False | False | False | False | False | False | False | False | False |
| 2 | False | False | False | False | False | False | False | False | True | False | ... | False | False | False | False | False | False | False | False | False | False |
| 3 | False | False | False | False | False | False | False | False | True | False | ... | False | False | False | False | False | False | False | False | False | False |
| 4 | False | False | False | False | False | False | False | False | True | False | ... | False | False | False | False | False | False | False | False | False | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 995 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 996 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 997 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 998 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 999 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
1000 rows × 2000 columns
# Apply the any() function to the same call
hmp_metadata.isna().any()
type(hmp_metadata.isna().any())
Unnamed: 0 False
PSN False
RSID False
VISITNO False
SEX False
...
OTU_97_11825 False
OTU_97_11826 False
OTU_97_11827 False
OTU_97_11828 False
OTU_97_11829 False
Length: 2000, dtype: bool
pandas.core.series.Series
# Pull up a list of columns from our DataFrame
hmp_metadata.columns
type(hmp_metadata.columns)
Index(['Unnamed: 0', 'PSN', 'RSID', 'VISITNO', 'SEX', 'RUN_CENTER',
'HMP_BODY_SITE', 'HMP_BODY_SUBSITE', 'SRS_SAMPLE_ID', 'OTU_97_1',
...
'OTU_97_1182', 'OTU_97_11820', 'OTU_97_11822', 'OTU_97_11823',
'OTU_97_11824', 'OTU_97_11825', 'OTU_97_11826', 'OTU_97_11827',
'OTU_97_11828', 'OTU_97_11829'],
dtype='object', length=2000)
pandas.core.indexes.base.Index
# Take our series of booleans and retrieve ONLY the column names that are true
hmp_metadata.columns[hmp_metadata.isna().any()]
# What is the object we get back?
type(hmp_metadata.columns[hmp_metadata.isna().any()])
# How long is our list?
len(hmp_metadata.columns[hmp_metadata.isna().any()]) # only one column (8) contains NaN
Index(['SRS_SAMPLE_ID'], dtype='object')
pandas.core.indexes.base.Index
1
tolist() method to convert your objects¶From above we see that all of our code creates an Index object which is part of the pandas package. It has some extra properties and methods that might be useful to use but we can also simply convert it to a simpler Python list if all we care about is the column names. To accomplish this we'll use the built-in method tolist() which will iterate through the object and produce a simple list of python scalars or pandas scalars for more complex values like Timestamps or intevals.
# Use the tolist() function
hmp_metadata.columns[hmp_metadata.isna().any()].tolist()
# Cast the information to a simple list
list(hmp_metadata.columns[hmp_metadata.isna().any()])
['SRS_SAMPLE_ID']
['SRS_SAMPLE_ID']
.loc[] and .iloc[] can accept boolean lists as well¶So we have a way to get a list of columns with NaN values. In fact we can use similar methods to retrieve column names for any columns that meet conditional criteria we are interested in. Suppose we wanted to cull a DataFrame down to the information we are explicitly interested in?
While we discussed the uses of loc[] and .iloc[] in the context of having column names and index positions, both of these methods can accept boolean series/arrays/lists as a means to filter for specific columns. Remember that multi-indexing uses a [row, column] format so how do we get just our columns with NaN data?
# Use `.loc` to retrieve a slice of our DataFrame with columns containing NaN values
hmp_metadata.loc[:, hmp_metadata.isna().any()]
| SRS_SAMPLE_ID | |
|---|---|
| 0 | SRS012191 |
| 1 | NaN |
| 2 | NaN |
| 3 | NaN |
| 4 | NaN |
| ... | ... |
| 995 | SRS018692 |
| 996 | SRS018694 |
| 997 | SRS018696 |
| 998 | SRS018698 |
| 999 | SRS018700 |
1000 rows × 1 columns
Much like the isna() method, the DataFrame class has additional methods for checking for null value types like:
isnull(): actually just an alias of isna()notna(): the boolean inverse of isna()# also returns booleans, like isna()
hmp_metadata.isnull().any()
# How many columns have null values?
len(hmp_metadata.columns[hmp_metadata.isnull().any()])
# Remember these are booleans so we can sum them!
hmp_metadata.isnull().any().sum()
# Which ones?
hmp_metadata.columns[hmp_metadata.isnull().any()].tolist()
Unnamed: 0 False
PSN False
RSID False
VISITNO False
SEX False
...
OTU_97_11825 False
OTU_97_11826 False
OTU_97_11827 False
OTU_97_11828 False
OTU_97_11829 False
Length: 2000, dtype: bool
1
1
['SRS_SAMPLE_ID']
# How many columns contain non-NaN values?
hmp_metadata.notna()
.any()
.sum()
2000
What did we just do above there? While we've been encountering some similar examples in this section, we haven't really addressed this coding style. One of the great things in working with an object oriented language like Python is that we can use methods in a tandem order, also known as chaining. Since many methods or properties return an object, as long as we know what kind of object it returned, we can stack method calls on the resulting object. This is especially helpful in the pandas package where core functions are often present as class methods.
From above:
hmp_metadata.isna() returns a DataFrame upon which we call...DataFrame.any() returns a Series upon which we call...Series.sum() returns a scalar in this caseOf course you need to know which objects you are dealing with and their methods but it makes a pretty simple line of code!
Now that we've explored our metadata file a little bit and know that there are some missing values in the SRS_SAMPLE_ID column, we can proceed to the next step of our journey: reshaping the data. Recall that we currently have some ~2000 columns representing operational taxonomic units (OTU) and a quantification of how many of those units are present within a sample.
hmp_metadata take a subset¶A recap on the formating to be done on the metadata file:
Import and explore the file to identify issues that might be present with your data. What general problems may arise as you wrangle this dataset? $\checkmark\checkmark$
Create a subset of your data. Sometimes you are working with VERY large datasets. The process of wrangling a large file can take a lot of processing power and time. You don't want to spend your time experimenting on the entire dataset to come up with a customized procedure to wrangle it.
Remove unwanted data. Some columns or rows may be irrelevant to your analysis or unuseable due to missing data discovered in step 1. Eliminate these as needed.
Convert your dataset into long format to help conform with tidy data principles.
As mentioned previously, preprocesing data with large datasets is a rather inefficient strategy -not so bad if you have access to vast computational power but still... There is quite a bit of trial and error involved to find out what works best on your data, and the larger the dataset, the longer it will take the output to be generated for you to inspect and evaluate it.
Instead, we take a representative sample of the full dataset, write code to preprocess it, and then apply the program we wrote to the full dataset. For example, the full hmp_metadata file has 43,149 rows and 2,898 columns, where columns 9 and onwards (0 indexation) are OTUs; Four OTU columns will be enough for data wrangling so we can ignore 2,889 columns for now. The easiest way to subset data would be using iloc[] or loc[].
# Grab the first 100 rows of data and 14 columns
hmp_metadata_subset = hmp_metadata.iloc[...]
# View our subset
hmp_metadata_subset
random.choice() function¶Sure we could simply take a simple subset of our data but we can also take a random subset selection that is representative of our data set. Taking random subsets is especially helpful if you are running certain tests on data for machine-learning or programming/testing new algorithms.
To accomplish our subsetting goals, we will use Numpy's random.choice() function to generate a representative subset of hmp_metadata. Of course we'll still want those first 9 columns - it's the OTU data that we want to randomize.
import numpy as np
np.random.choice?
Now that we know more about random.choice, let's use it. Given that the a arguments expects a 1-D array or an integer (see random.choice's documentation), we will use the shape attribute
np.random.seed(seed = 1) # The number to which you "set the seed" must be the same in order to obtain the same results (reproducibility)
# Let's start by sampling 100 rows (shape[0])
np.random.choice(a = hmp_metadata. ..., size = 100, replace = False)
# Now sample 14 columns (shape[1])
np.random.choice(a = hmp_metadata. ..., size = 14, replace = False)
Done. Subsampling from both rows and columns seems to be working as expected so now we can create a subset. How do we use those two lines of code to sample from hmp_metadata without ending up with booleans for a final ouput?
# We use squared brackets to subset hmp_metadata and get actual values instead of booleans
hmp_metadata_subset = hmp_metadata.iloc[np.random.choice(a = hmp_metadata.shape[0], size = 100, replace = False), #rows
np.random.choice(a = hmp_metadata.shape[1], size = 14, replace = False)] #cols
hmp_metadata_subset. ...
np.arange()¶Not quite what we wanted... We did not catch the first nine columns and we need those. On a side note, seeing only OTU columns is simply a matter of probability as there are thousands of OTU columns versus 9 non-OTU columns. Remember we are merely giving .iloc[] a list of index positions. All we have to do is provide a static and random portion and put them together.
To achieve this we'll use the Numpy arange() function. No that's not a typo. Python 3 already has a range() function that returns a generator (remember those?) for the range specified. Numpy's version returns an array for the range supplied from start (inclusive) to end (exclusive).
To put two arrays together, we use the np.append() function. Let's give it a try.
# Make two ranges and append them togehter
np.append(np.arange(0, 5), ...)
# Set our seed for consistency
np.random.seed(seed = 1)
# Make our multi-index call
hmp_metadata_subset = hmp_metadata.iloc[np.random.choice(a = hmp_metadata.shape[0], size=100, replace = False), # Rows
np.append(np.arange(...), # Grab the first 9 columns
# Randomly select 5 more columns after that
np.random.choice(a = np.arange(9, ...),
size = 5, replace = False)
) # End our append call
] # Multi-index call
# Check it's dimensions
hmp_metadata_subset.shape
# View the DataFrame
hmp_metadata_subset.head(10)
sample() method as an alternative to np.random.choice()¶An alternative to random.choice() for random subsampling of either rows or columns is the sample() method. Using the parameters you can decide on:
n: the number of items from axis to return.frac: an alternative to a number of items, take a fraction of the total items. Cannot be used with n.axis: either 0 for rows (default for Series and DataFrames), 1 for columns.replace: allow or disallow sampling of the same item more than once.random_state: the parameter for setting a random seed for the pseudorandom generator. Otherwise it sets its own.# Use the Jupyter help window instead of printing the information to an output cell
hmp_metadata_subset.sample?
# Pull 100 rows from our original DataFrame object
hmp_metadata.sample(n = 100, replace=False, weights=None, ...).head()
# weights is all rows have equal probability of being picked
# We can also subset by a fraction of the dataset
hmp_metadata.sample(..., replace = False, weights=None, random_state=1, axis=0)
# 3.45% of rows (axis 0) and all columns.
Before we continue let's look deeper at our data subset to see what our data is like. We could have done this with our larger dataset but it would likely be overwhelming. We can use properties like .dtypes to find out the data types for our columns, and the method describe() to produce summary statistics for those columns too.
# What kind of data types are we working with?
hmp_metadata_subset. ...
# Generate summary statistics for our set
hmp_metadata_subset. ...
# count is the number of non-missing values
describe() method works for numeric information¶From above, we see that our summary statistics do not include the SEX, RUN_CENTER, HMP_BODY_SITE, HMP_BODY_SUBSITE, and SRS_SAMPLE_ID columns. Given that these are character-based, the describe() method cannot generate any summary statistics for these columns.
If, however, we explicitly provide non-numeric column(s) to describe() we are returned a truncated or altered version of the output. You can provide a single or multiple character columns with similar results.
# Provide a single column
hmp_metadata_subset.iloc[:, ...].describe() # We get a count of how many non-na values are in the column
# Provide multiple columns
hmp_metadata_subset.iloc[:, ...].describe()
unique() method to examine values in your DataFrame¶As we can see from above, there are 18 unique HMP_BODY_SUBSITE values. We can pull see what they are using the unique() method on a column from our DataFrame. This is a quick way to see the nature of values in your dataset as well. Let's give it a try.
# Look at the subsite information
hmp_metadata_subset["HMP_BODY_SUBSITE"] ...
np.Array with the .values property¶Another way to access portions or all of the values in your DataFrame, you can use the .values property which you can also subset with multi- or chained-indexing.
# Pull down the values from our DataFrame
hmp_metadata_subset.values # easier to see when we have smaller datasets.
# Subset the values
hmp_metadata_subset.values[...]
So we've discussed a few ways to generate subsets from our data. This is helpful when generating code to reshape our data because the original information can be extremely dense and, in some cases, computational intensive to just "play" with. We've even generated some summary information our subsets as well, just to help us sanity check for larger issues we may have overlooked when initially importing the data.
Let's see where we are in our steps:
Import and explore the file to identify issues that might be present with your data. What general problems may arise as you wrangle this dataset? $\checkmark\checkmark$
Create a subset of your data. Sometimes you are working with VERY large datasets. The process of wrangling a large file can take a lot of processing power and time. You don't want to spend your time experimenting on the entire dataset to come up with a customized procedure to wrangle it. $\checkmark\checkmark$
Remove unwanted data. Some columns or rows may be irrelevant to your analysis or unuseable due to missing data discovered in step 1. Eliminate these as needed.
Unnamed: 0 column. It looks to be an artefact from when the file was originally exported. It's not a column we need and we should remove it.Convert your dataset into long format to help conform with tidy data principles.
Up until this point we haven't really altered any of the data we imported. As per our data wrangling plan, however, the next step is to remove column 0 from hmp_metadata. It is just a column with indices starting at 1 that was created when the file was written to disk (saved).
Before we start deleting anything from the dataset, inspect the dataset and pull out a list with column names. The fastest way to retrieve a list of column names is to cast the DataFrame as a list().
# Make a quick copy of our subset
hmp_metadata_copy = hmp_metadata_subset.copy()
# Look at the first 5 rows
hmp_metadata_subset.head()
# Cast the DataFrame to a list
list(hmp_metadata_subset)[...] # Print just 10 column names, otherwise is a very long list
drop() method¶As usual, there are several ways to achieve the same goal. In this case, we want to remove the first column, unnamed 0, from our dataframe. We can remove that column by its index using drop() method from the DataFrame class. This is accomplished in a single call or one-liner (nickname for programs that are one line long).
You can complete this code using the index position or the column name itself.
# Drop by a column by its index
hmp_metadata_subset_drop = hmp_metadata_subset. ...(hmp_metadata_subset.columns[[0]], axis = 1)
# this syntax will remove one everytime you run the code so use first option instead (safer)
hmp_metadata_subset_drop.head()
Even though the above code does its job as expected, it is very dangerous. Since it uses indexation to locate the column we wish to remove, once the current column 0 is removed, all other columns will be shifted one position to the left. As a consequence you will now have another column at index 0, and if you run this piece of code again, it will continue removing whatever column is present at index 0. You can end up removing many more columns than you intended, by mistake, with potential serious negative implications for your analyses.
Thus, it is safer to use a syntax that explicitly looks for the column of interest, like using the full column name or by partial matching (more on that on lecture 5).
# Restore our subset since we've already altered it
hmp_metadata_subset = hmp_metadata_copy.copy()
# Remove a column by its name/label
hmp_metadata_subset.drop(labels=..., axis=1, inplace=...) # labels are indices names for either columns or rows.
# inplace overwrites dataframe instead of us explicitly overwritting the variable
# del my_dataframe['column_name'] # This is another way to remove the column by its name
hmp_metadata_subset.head()
If we try to run this same code again, we get a KeyError because there is no column called "Unnamed: 0" anymore.
hmp_metadata_subset.drop(labels='Unnamed: 0', axis=1)
Now that we've removed the Unnamed: 0 column from our DataFrame, let's see where we are in our steps:
Import and explore the file to identify issues that might be present with your data. What general problems may arise as you wrangle this dataset? $\checkmark\checkmark$
Create a subset of your data. Sometimes you are working with VERY large datasets. The process of wrangling a large file can take a lot of processing power and time. You don't want to spend your time experimenting on the entire dataset to come up with a customized procedure to wrangle it. $\checkmark\checkmark$
Remove unwanted data. Some columns or rows may be irrelevant to your analysis or unuseable due to missing data discovered in step 1. Eliminate these as needed. $\checkmark\checkmark$
Unnamed: 0 column. It looks to be an artefact from when the file was originally exported. It's not a column we need and we should remove it.Convert your dataset into long format to help conform with tidy data principles.
melt() function¶We can "melt" all of the OTU variable values into a single column. This essentially means we are going to compress all of the OTU column information into two columns. The first column holds information like a variable name (ie the OTU code), while the second column holds the value. Thus a new observation (row) is added for each variable (column) we are melting. Another important aspect of this step is that the unmelted information (untouched columns) in each observation must be unique to that observation. In our case, columns 0-6 hold the information that will remain untouched. That data will now be repeated multiple times for each new variable that is converted to an observation.
We'll accomplish this with the Panda function melt() or the identical DataFrame method of the same name. The key parameters to keep in mind are:
frame: the DataFrame we want to meltid_vars: a tuple/list/array of column name(s) to use as identifiers. These are the unique columns in our data that differentiate between observations. They will be "unchanged" by the melt.value_vars: a tuple/list/array of column name(s) to use as variable columns. If None, then all non id_vars columns will be used as variable columns.var_name: the name for your variable column, defaults to variable.value_name: the name for your variable values column, defaults to value.pd.melt?
# Remind ourselves what the DataFrame looks like
hmp_metadata_subset.head()
# How big is our subset again?
hmp_metadata_subset.shape
# Take a quick look at the column names again
hmp_metadata_subset.columns[...]
# Whatever column not in this list will be turned into the new OTU column
# Call on the melt function
...(frame = hmp_metadata_subset,
id_vars= hmp_metadata_subset.columns[0:8],
var_name=...,
value_name=...)
# Call on the melt method from the DataFrame object itself
...(id_vars= hmp_metadata_subset.columns[0:8],
var_name="OTU",
value_name="count")
melt() function¶Can you explain why our melted data frame is 500 rows x 10 columns?
If you wanted to keep a subset of identifier columns that were not strictly grouped together you have a few options for creating that list to pass on to the id_vars parameter of melt():
# Create a list of variables
my_id_variables = ['PSN','RSID','VISITNO','SEX','RUN_CENTER','HMP_BODY_SITE','HMP_BODY_SUBSITE','SRS_SAMPLE_ID']
# Melt the subset
pd.melt(hmp_metadata_subset,
id_vars= my_id_variables,
var_name= 'OTU',
value_name= 'COUNT')
# If you have too many column names, can use slicing and pass the whole DataFrame to id_vars
my_id_variables = hmp_metadata_subset.loc[: , ...]
# Melt the subset
hmp_metadata_subset_melt = pd.melt(hmp_metadata_subset,
id_vars= my_id_variables,
var_name= 'OTU',
value_name= 'COUNT')
hmp_metadata_subset_melt.head()
print()
print(hmp_metadata_subset_melt.OTU.unique())
And... that was the last change to hmp_metadata. Now that we have the file in the shape we wanted, let's preprocess the entire dataset by combining all of the code snippets we've generated.
import pandas as pd
# hmp_metadata = pd.read_csv("data/human_microbiome_project_metadata.csv")
# 1. Make a copy of the full dataset to avoid re-reading in the large metadata file
# in case something goes wrong -you never know...
hmp_metadata_copy = hmp_metadata.copy()
# Inspect the column names in dataframe
# list(hmp_metadata_copy)
# 2. Remove "Unnamed 0" column
hmp_metadata_copy_drop = hmp_metadata_copy.drop(labels='Unnamed: 0', axis=1)
# hmp_metadata_renamed_column0Removed = hmp_metadata.copy()
# 3. Make a list of the column names that we want to preserver as column. All other columns will be used to populated the melted version of the data frame
my_id_variables = hmp_metadata_subset.loc[ : , 'PSN':'SRS_SAMPLE_ID']
# 4. Melt the data frame (convert into long format)
hmp_metadata_tidy = pd.melt(hmp_metadata_copy_drop,
id_vars= my_id_variables,
var_name= "OTU",
value_name= 'count')
# Take a peek at the data
hmp_metadata_tidy.head(n=20)
# Learn some information about what we have now
hmp_metadata_tidy.info()
One file down. One more to go.
hmp_taxa¶hmp_taxa¶Recall our general steps:
Import and explore the file to identify issues that might be present with your data. What general problems may arise as you wrangle this dataset?
Create a subset of your data. Sometimes you are working with VERY large datasets. The process of wrangling a large file can take a lot of processing power and time. You don't want to spend your time experimenting on the entire dataset to come up with a customized procedure to wrangle it.
Remove unwanted data. Some columns or rows may be irrelevant to your analysis or unuseable due to missing data discovered in step 1. Eliminate these as needed.
Convert your dataset into long format to help conform with tidy data principles.
Remember that the ultimate goal is to join the metadata and OTU-taxa datasets. Let's start by importing and inspecting the OTU taxa set.
# Read in our file
hmp_taxa = pd.read_csv(...)
# Take a quick look at the DataFrame
hmp_taxa.head()
# Additional information about our DataFrame
hmp_taxa.info()
Check for NaN values within our data.
# Check for NaN (boolean, would be the same as we did for metadata file)
hmp_taxa.isna().any()[:10] # Print the first ten variables (columns) that have NaN values
# Here's some equivalent code
...(hmp_taxa).any()[:10] # Print the first ten variables (columns) that have NaN values
NaNs are much more abundant in this dataset compared to hmp_metadata. How many columns with NaN values are we dealing with?
# Return the columns with NA values in them
...[:, hmp_taxa.isna().any()]
hmp_taxa¶We've imported and taken a look at our data. So 198 or so columns contain incomplete data but we'll just leave those for now.
Recall our OTU-taxa file plan:
Import and explore the file to identify issues that might be present with your data. What general problems may arise as you wrangle this dataset? $\checkmark\checkmark$
Create a subset of your data. Sometimes you are working with VERY large datasets. The process of wrangling a large file can take a lot of processing power and time. You don't want to spend your time experimenting on the entire dataset to come up with a customized procedure to wrangle it.
Remove unwanted data. Some columns or rows may be irrelevant to your analysis or unuseable due to missing data discovered in step 1. Eliminate these as needed.
Convert your dataset into long format to help conform with tidy data principles.
Let's take just a small subset of taxa columns to work with.
# Create a subset using the copy() method.
hmp_taxa_subset = hmp_taxa.iloc[ : , :10]. ... # Nine OTU columns are enough for our purpose.
# What happens if we don't use copy()?
We've imported and taken a look at our data and subsetting was quite straight-forward.
Recall our OTU-taxa file plan:
Import and explore the file to identify issues that might be present with your data. What general problems may arise as you wrangle this dataset? $\checkmark\checkmark$
Create a subset of your data. Sometimes you are working with VERY large datasets. The process of wrangling a large file can take a lot of processing power and time. You don't want to spend your time experimenting on the entire dataset to come up with a customized procedure to wrangle it. $\checkmark\checkmark$
Remove unwanted data. Some columns or rows may be irrelevant to your analysis or unuseable due to missing data discovered in step 1. Eliminate these as needed.
Unnamed: 0 column to rank_name. This is an extra modification - extra because is not indispensable. Convert your dataset into long format to help conform with tidy data principles.
hmp_taxa with the rename() method¶Renaming columns from our DataFrame is pretty straightforward but does require some thought. First, renaming a single column could normally be a simple case but what happens if we want to rename multiple columns? The Pandas package lets us rename multiple DataFrame indices (rows) OR columns with the rename() method but we should take careful note of the parameters:
mapper: This is a dictionary-like object or function to apply to the axis values. This helps to link the values we want to change to their new values.axis: 0 for index or 1 for columnscolumns: alternative to using the mapper and axis parameters to explicitly rename just columnsinplace: a boolean which will either directly change the DataFrame and return nothing (True) or return an updated DataFrame object (False).Let's start by renaming that column.
# change "Unnamed: 0" to rank_name
hmp_taxa_subset.rename(columns = ..., inplace=True) # Notice that rename takes a dict as input
# Show some of the rows
hmp_taxa_subset.head(7)
hmp_taxa_subset¶So we now have a subset of our data and we've renamed that first column to rank_name. It's time to remove that first row from our dataset.
From hmp_taxa_subset let us remove row 0 (consensus lineage). The OTU-taxa dataset is redundant: row 0 (consensus lineage) contains the same information as rows 1 to 6. This redundancy means that we have two alternatives to convert taxonomic ranks into columns and OTUs into a single column (as we did with hmp_metadata):
Use only row 0, split the contents of each cell based on special characters such as colons and underscores, use the single letters as taxonomic ranks (e.g. c stands for "class"), and then melt and populate the dataset with the contents of the OTU columns. This approach requires the use of regular expressions (regex), and we are not quite there yet. We will revisit this same exercise in week 5 (regex) and will use this option to format this same dataset.
Remove "consensus lineage" from the dataset and use the remaining six rows and all OTU columns. This alternative is more suitable for our Python programing skills at this point so that is what we will use.
Remember how we used the drop() method to remove a column from hmp_metadata? Now we can go ahead and use the same method to drop a row as well. By default it will assume you are referring to row indices (axis = 0) when you provide an index as an integer.
# Remove row 0 (concensus lineage )
hmp_taxa_subset.drop(labels = ..., axis = ..., inplace = True)
hmp_taxa_subset
We've now completed some data reformat and trimming and we're onto the final step!
Import and explore the file to identify issues that might be present with your data. What general problems may arise as you wrangle this dataset? $\checkmark\checkmark$
Create a subset of your data. Sometimes you are working with VERY large datasets. The process of wrangling a large file can take a lot of processing power and time. You don't want to spend your time experimenting on the entire dataset to come up with a customized procedure to wrangle it. $\checkmark\checkmark$
Remove unwanted data. Some columns or rows may be irrelevant to your analysis or unuseable due to missing data discovered in step 1. Eliminate these as needed. $\checkmark\checkmark$
Unnamed: 0 column to rank_name. This is an extra modification - extra because is not indispensable. Convert your dataset into long format to help conform with tidy data principles.
OTU and taxonomic_level.rank_name.hmp_taxa_subset into a 3-column dataset¶There is only one column, rank_name that we need to keep as an identifier. All of the other columns are OTUs that we'll want to melt into OTU and taxonomic_level.
What will our melted DataFrame look like?
| rank_name | OTU | taxonomic_level |
|---|---|---|
| SUPERKINGDOM | OTU_97_1 | Bacteria |
| PHYLUM | OTU_97_1 | Firmicutes |
| CLASS | OTU_97_1 | Bacilli |
| ORDER | OTU_97_1 | Lactobacillales |
| FAMILY | OTU_97_1 | Lactobacillaceae |
| GENUS | OTU_97_1 | Lactobacillus |
| ... | ... | ... |
Challenge
Melt hmp_taxa_subset and create a new dataframe called hmp_taxa_subset_melt.
# Use the pandas melt() function on hmp_taxa_subset
hmp_taxa_subset_melt = pd.melt(hmp_taxa_subset,
id_vars = ...,
var_name = ...,
value_vars = hmp_taxa_subset.columns[...], # optional
# Column(s) to unpivot otherwise all non-id_vars columns will be converted
value_name = "taxonomic_level")
# Take a peek at the melted DataFrame
hmp_taxa_subset_melt.head(12)
# Retrieve some information on our data
hmp_taxa_subset_melt.info()
pivot() method¶Looking at our melted data, we have now converted the OTU columns into rows. Taking a closer look, however, is the information in a format that we want to work with? When we think of the identifier information from our other table, hmp_metadata, it is hard to imagine how we'll combine the tables (our ultimate goal).
Furthermore, keeping taxonomic ranks as rows will bring difficulties later on when we try to access the data for statistics and plotting. This format also generates six rows per OTU (six taxonomic ranks) instead of a single row per OTU. You can see that what are essentially identifiers of the same "observation" are therefore spread across 6 rows. This doesn't exactly follow our philosophy on tidy data.
Instead, let's convert the information in rank_name to column names, which will convert the redundant OTUs column values into a column of unique values, with the taxonomic_level values filling out the new columns made from rank_name.
To accomplish our conversion goals, we'll use the pivot() method which has the following relevant parameters:
index: the column used to make the new DataFrame's index. Choosing this can be crucial if we plan on joining with another DataFrame.columns: the column from which you are going to retrieve new column namesvalues: the column(s) to use for populating the new column valuesRecall that we can directly call on column names as though they are attributes/properties of the DataFrame. As sanity check, let's see what our new column names would be.
# Generate a unique list of values from the rank_name columns
# These will be new columns in our hmp_taxa DataFrame
hmp_taxa_subset_melt ...
pivot() method...¶Challenge
We have 6 unique ranks and each will become its own column once we "pivot" the melted data frame, in addition to the OTU column. Use Panda's DataFrame pivot() method on the hmp_taxa_subset_melt dataframe, making sure that the OTU values become the index for the new dataset - this is key and will be used when joining the two datasets.
# Use pivot with the dataset
hmp_taxa_subset_melt_pivot = hmp_taxa_subset_melt.pivot(index = ...,
columns = ...,
values = ...)
hmp_taxa_subset_melt_pivot.head()
pivot() method has no knowledge of your column order¶As we can see from above, we did manage to spread our data out but the columns have been placed by alphabetical order. While that may not be a huge deal for Python when working with the data in later steps, our human brains betray us. We'd much rather have these information in the taxonomic order that we're used to.
To accomplish this we can reorder our DataFrame based on either an explicitly coded or data-generated list. Recall that using the [] indexing operation with a list will return a DataFrame with the columns in the order specified.
# The two following sets of code happen to produce the same output
# Explicit naming of columns to reorder
hmp_taxa_subset_melt_pivot_reordered = hmp_taxa_subset_melt_pivot[['SUPERKINGDOM', 'PHYLUM', 'CLASS',
'ORDER', 'FAMILY', 'GENUS']]
# Take a quick look at the table
hmp_taxa_subset_melt_pivot_reordered.head()
# Using a call to our original data to set up the column order
hmp_taxa_subset_melt_pivot_reordered = hmp_taxa_subset_melt_pivot[...]
# Take a quick look at the table
hmp_taxa_subset_melt_pivot_reordered.head()
# Take a look at the column and index information
hmp_taxa_subset_melt_pivot_reordered.columns
hmp_taxa_subset_melt_pivot_reordered.index
A quick note about the pivot conversion. You can see that the column names are returned in an Index object whose name is "rank_name" and the row indices are returned in an Index object named "OTU". This gives us the output we see when we call on the head() method of our pivoted DataFrame.
hmp_taxa dataset¶Now we have our dataset in the way we wanted it. Let's apply the code to the entire dataset so we can finally join the two datasets.
# 1. Read-in the file
# hmp_taxa = pd.read_csv("data/human_microbiome_project_otu_taxa_table.csv")
hmp_taxa.head()
# 2. Rename column 0 into something meaningful
hmp_taxa_renamed = hmp_taxa.rename(columns={'Unnamed: 0':'rank_name'}) #, inplace=True)
# 3. Remove row 0 (all taxonomic ranks in single "cells")
hmp_taxa_renamed_drop = hmp_taxa_renamed.drop(hmp_taxa_renamed.index[0]) #, inplace=True) # get rid of row 0
# 4. Select the columns that you want to be turn into a single column
my_variables_taxa = list(hmp_taxa_renamed_drop.columns[1:len(hmp_taxa.columns)]) # all columns excep column 0
# 5. Convert dataset into long format using melt
hmp_taxa_renamed_drop_melt = pd.melt(hmp_taxa_renamed_drop, # dataset that you want to reshape
id_vars = 'rank_name', #Column(s) to use as identifier variables
var_name = 'OTU', # Name to use for the 'variable' column
value_vars = my_variables_taxa, # Column(s) to melt
value_name = 'taxonomic_rank')
# 6. Convert the dataset into wide-format to make each taxonomic rank a variable
hmp_taxa_renamed_drop_melt_pivot = hmp_taxa_renamed_drop_melt.pivot(index = 'OTU',
columns = 'rank_name',
values= "taxonomic_rank")
# 7. Reorder columns hierarchiecal
# hmp_taxa_renamed_drop_melt_pivot_reordered = hmp_taxa_renamed_drop_melt_pivot[['SUPERKINGDOM', 'PHYLUM', 'CLASS', 'ORDER', 'FAMILY', 'GENUS']]
hmp_taxa_tidy = hmp_taxa_renamed_drop_melt_pivot[['SUPERKINGDOM', 'PHYLUM', 'CLASS', 'ORDER', 'FAMILY', 'GENUS']]
# 7. Get rid of OTU word in rank_name
# hmp_taxa_renamed_drop_melt_reordered.index.names = ['rank_name'] # index names takes a dictionary
hmp_taxa_tidy.head()
hmp_taxa_tidy.info()
Notice that the dataset does not have 999 unique OTUs. We have 999 rows because some OTUs have the same taxonomic assignment (this is explained by how OTUs are created using similarities against a consensus sequence). In fact, we have 85 unique genera.
len(hmp_taxa_tidy["GENUS"]. ...)
# A summary of each taxonomic rank
hmp_taxa_tidy.drop_duplicates().count()
Can you guess why unique() gives us 85 values while drop_duplicates().count() gives us 86 values?
hmp_metadata_tidy and hmp_taxa_tidy¶Into the final stretch, now that we have our data in a tidy format, it is time to merge it together. This can also be referred to as joining our data. In order to successfully do so, we need at least one column from each DataFrame that we can use as a reference to combine the data. These columns can also be referred to as keys.
In our case, we will use the unique OTU identifiers as a key to join the two datasets by matching their indices. Other programs do a good job of merging tables for us versus using a copy-paste method in spreadsheet software. You can mess up your dataset and ultimately negatively affect your results.
There are several ways to join two datasets with the Pandas package:
join(): a DataFrame method to join multiple DataFrame objects by index at onceconcat(): Use various methods and options to concatenate multiple DataFrame objectsmerge(): combine two DataFrames into a single DataFrame by describing relevent keys# hmp_metadata.join?
# pd.concat?
# pd.merge?
isin() method¶Before we proceed with our example, we will subset the datasets in such way that they both have the same OTUs so we can get them to match -the same 10 OTUs in each dataset will be enough. The first task is to select 10 OTUs from either file and then use those 10 OTUs to subset both data frames.
The first step of choosing some OTUs from hmp_taxa_tidy is easy. To check for entries that exist in hmp_metadata_tidy we need to compare the correct column of information, and see if each element of that column is contained within our set of OTU values. The isin() method will return a DataFrame of booleans on whether any elements in our DataFrame object are contained within the values we supply.
# Step 1. select 10 OTUs from hmp_taxa_tidy
ten_otus = hmp_taxa_tidy.index[...].tolist()
ten_otus # this is a list of 10 OTU
#What is the shape of our dataset
hmp_metadata_tidy.shape
# What does isin() return? That depends on what our [] call returns!
type(hmp_metadata_tidy["OTU"]. ...)
# Step 2. check for membership within hmp_metadata_tidy
# In metadata, OTUs are a variable (a column)
hmp_metadata_10_otus = hmp_metadata_tidy[...]
# Look at the subset
hmp_metadata_10_otus.head()
# In taxa, OTUs are indices
hmp_taxa_10_otus = hmp_taxa_tidy[hmp_taxa_tidy ...]
# Look at the subset
hmp_taxa_10_otus.head(10)
# Sanity check our keys now before proceeding
# OTUs in hmp_metadata_10_otus
hmp_metadata_10_otus["OTU"]. ...
print()
print("Unique OTUs in hmp_metadata_10_otus: " + str(len(hmp_metadata_10_otus["OTU"].unique())))
# OTUs in hmp_taxa_10_otus
hmp_taxa_10_otus.index.unique()
print()
print("Unique OTUs in hmp_taxa_10_otus: " + str(len(hmp_taxa_10_otus.index.unique())))
# Directly compare the two groups
hmp_metadata_10_otus["OTU"].unique() ... hmp_taxa_10_otus.index.unique()
Both datasets have the same OTUs.
merge() function to combine two DataFrames¶Let's take a closer look at the merge() function which has some of the following parameters
left: a DataFrame you wish to merge into.right: a DataFrame or a named Series that you'll be merging with.how: the type of merge you want to perform. Options are left, right, outer, and inner (default).left_on: the column or index level names to join on in the left DataFrame.right_on: the column or index level names to join on in the right DataFrame.left_index: Use the index from the left DataFrame as the join key (True)right_index: Use the index from the right DataFrame as the join key (True)Let's combine our two subsets now hmp_metadata_10_otus and hmp_taxa_10_otus now that we've ensured they share the same keys. Note that this isn't strictly necessary. Each join type in the how parameter deals with merging observations based on keys in a different manner.
| Join type | Description |
|---|---|
| left | Keep everything from the left DataFrame and only merge intersecting keys from the right DataFrame. Missing keys are filled with NaNs |
| right | Keep everything from the right DataFrame and only merge intersecting keys from the left DataFrame. Missing keys are filled with NaNs |
| outer | AKA a full outer join, creates the union of keys and merges based on these. Unmatched cases are filled with NaNs |
| inner | Creates the intersection of the keys from left and right and merges based on these |
The next step is to join the two datasets by a common column or index. If we decide to use a column, the columm that we use as key must have the same name in both datasets. If we decide to go with the indices as key for the merge, we just set merge()s arguments left_index and right_index to True.
In summary, we either set the hmp_metadata_tidy OTU column as index using the set_index() method OR make hmp_taxa_tidy's index into a column called OTU.
Let's merge the two datasets by their index.
# Set OTU as index for metadata
hmp_metadata_10_otus_withIndex = hmp_metadata_10_otus ...
hmp_metadata_10_otus_withIndex.head()
# Now we have two datasets with matching indices. Let's proceed to merge them
ten_otu_merged = pd.merge(hmp_taxa_10_otus,
hmp_metadata_10_otus_withIndex,
how = ..., # This is the default value for `how`
left_index=..., right_index=...)
# Check out some merged output
ten_otu_merged.head()
# How big is the merged set?
ten_otu_merged.shape
As simple as that! Now we have a single dataset so we achieved our last goal. Time to apply the code to the full datasets.
Disclaimer: This code will not run in your computer with the large datasets (no problem with the alternative subsets). It will crash. Merging the two files results in a file that is 20 Gb in size and has 12 million rows. If we started with two files, one was 251.2 Mb and the other 8.4 Mb, how is it possible that we end up with a 20 Gb plain text file?? Ultimately is the same amount of data... we just reshaped it...
# # The tidied taxa dataset
hmp_taxa_tidy.head()
hmp_taxa_tidy.info()
# the full metadata file with OTU added as index. Make a copy of the dataset so we don't modify the original one
hmp_metadata_fullWithOTUIndex = hmp_metadata_tidy.copy().set_index('OTU')
hmp_metadata_fullWithOTUIndex.head()
hmp_metadata_fullWithOTUIndex.info()
# Merge the two datasets
taxa_metadata_merged = pd.merge(hmp_taxa_tidy,
hmp_metadata_fullWithOTUIndex,
left_index=True, right_index=True)
taxa_metadata_merged.head()
taxa_metadata_merged.info()
# This is the code I used to subset the 20 Gb file called taxa_metadata_merged
# taxa_metadata_merged.sample(frac= 0.0001, replace = False, weights=None, random_state=1, axis=0) # This file will be used for the lecture
Hopefully you got a hang of what data wrangling (prepocessing) is about and why it often represents 80 to 90% of your data analysis time. The next time you start planning your experiment, it is advisable to also plan ahead how to record your data so you do not have too spend too long formating it.
As a bonus, if you are mining data or repeating experiments and always getting the same kind of data format, then once you have built these kinds of scripts you can reuse them!
to_csv() method¶The Pandas DataFrame class has a built-in method for exporting our data to a various file formats like the comma- or tab-separated value formats. We can actually just use the to_csv() method for both file types because we can alter some of the parameters during the call:
path_or_buf: a file path or object to write to. In this case just stick with a path.sep: a string of length "1" denoting how you want to delimit your values ie ,, ;, \t.index: a boolean to write out the row names as a column (default is True).header: write out the column names (default is True). You can also pass a list of strings to overwrite the column names.# ?pd.DataFrame.to_csv
# We'll use a small dataset to write to files
# Write an actual csv file
ten_otu_merged.to_csv('data_writing_test.csv', sep = ... , index = False)
# \t for tab-separated value. Notice that the function names is still to_csv
ten_otu_merged.to_csv('data_writing_test.tsv', sep = ... , index= True)
print('Done writing files!')
to_excel() method¶Similar in idea to the to_csv() method, this will allow us to directly write our DataFrame objects as Microsoft Excel files. We just need to set the proper parameters:
excel_writer: the file path where we want to send our data.na_rep: how we want to present missing data ('', empty quotes are the default).index: a boolean to write out the row names as a column (default is True).header: write out the column names (default is True). You can also pass a list of strings to overwrite the column names.# ?pd.DataFrame.to_excel
# We'll just use some small datasets to write to an excel file
ten_otu_merged. ...('data_writing_test.xlsx', index=True)
print('Done writing excel files!')
Now we are going to do some exploratory data analysis (EDA) on subset_taxa_metadata_merged.csv, a subset of the merged dataset (0.001%). It is always good practice to explore your to find more about aspects such as probability distributions, outliers, and central tendency and dispersion measures.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# plt.hist?
# Read in subset_taxa_metadata_merged.csv
merged_subset = pd.read_csv(..., na_values=('')) # fill empty cells with NaN -default behavior
# Take a peek at the data
merged_subset.head()
# How big is this dataset?
merged_subset.info()
So we've imported a dataset of ~12,500 rows across 16 columns. You'll note a new column that we didn't have in our prior dataset called "OTU". Previously we had left this as an index when we were completing our merges but now it exists as it's own column. Recall that at the time of importing this we could have used the index_col parameter to import OTU as we had it before.
Time to pull up a summary of the numeric information. Remember that we can use the describe() method to accomplish this.
# Get a description of the numeric data
merged_subset. ...
NaN values¶Next, some histograms to check the count distribution of our variables. Let's start with by examining the count column for values that are not NaN. We'll use the notna() method to return a same-shaped boolean series.
# How many non-NaN obervations in count?
len(merged_subset[merged_subset['count'].notna()]...)
# The same result with a slightly different approach. Which one is better?
merged_subset[merged_subset['count'].notna()].shape[0]
matplotlib.pyplot.hist() function to make a histogram¶We can make a pretty straighforward histogram from our data using the hist() function from the matplotlib package. We'll use all of the default parameters and simply give it an array (in this case a Series object) of numbers to work with. There are a number of parameters you can set including how it bins the data.
# Make a histogram for those non-NaN counts
...(merged_subset[merged_subset["count"].notna()]["count"])
# How many of those counts are 0?
len(merged_subset[merged_subset['count'] ... 0]['count'])
# How many are not 0?
len(merged_subset[merged_subset['count'] ... 0]['count'])
cut() function to convert your Series into binned data¶Much like the histogram above, we can create our own binned data using the cut() function which will accept as Series or 1-D array as input. Some helpful parameters are:
x: that's just our databins: means by which we will be our data, this can be a single integer (total bins), sequence of scalars (bin edges with left and right edges excluded), or IntervalIndex (non-overlapping bin information). labels: the labels for the returned binsinclude_lowest: should the first interval be left-inclusive (False by default).We can convert the returned output into a new column for our data, to further categorize it and then use the DataFrame hist() method to make a histogram. This actually calls on the same matplotlib hist() function we just used.
# pd.cut()
merged_subset['OTU_Abundance'] = pd.cut(merged_subset[merged_subset['count'].notna()]['count'],
bins = [...], # these values are arbitrary
labels=[80, 150, 350, 550]) # Categories: 80 or less; 81 to 150; 151 to 350; 351 to <550
# Use the hist() method on just one column
merged_subset['OTU_Abundance'].hist(figsize=[7,4])
Notice that above, because our call to cut() is left-exclusive, that we drop all of the 0-value data from our count column. Compare the two histograms we've created. When we look at the dataset with our new binning, the microbial (OTU) counts are mostly around the 100 mark.
Is this an accurate representation of the data distribution? Perhaps another kind of visualization would be appropriate to the message we are trying to convey! Something to discuss in another class altogether...
sort_values() method to help order your data¶How do we select for values within our merged_subset DataFrame? Let's play around with our data by identifying genera with counts greater than 0 where genus is not NaN. We'll also sort the values in descending order.
To sort values within a DataFrame we use the sort_values() method whose parameters include:
by: either a single string (column) or a list of multiple columns.axis: use 0 to sort by your index versus 1 to sort by columns.ascending: a boolean or list of booleans depending on your by argument.# Look at the merged_subset again
merged_subset. ...
( # Use parenthesis to clean up your method chaining so it's easier to read
merged_subset.loc[(merged_subset["GENUS"].notna()) ... # conditional #1
(merged_subset["count"] > 0), # conditional #2 for row selection
["OTU", "FAMILY", "GENUS", "count"] # column selection for GENUS and count columns
]
...(by="count", ascending=False) # Sort the values in descending order
)
groupby() method¶Look at our output. For the most part it looks correct as we now have far fewer rows BUT a closer inspection shows us an error. We didn't achieve a true count of each genera as the OTU classifications got in the way. What we want to do is group the rows by one or more columns so we can summarize the count values for each grouping.
To accomplish this we'll use the groupby() method which can group our DataFrame with some of the following parameters:
by: a label or list of labels relating to the index or columns. You can also pass some more complex structures but we won't get into that here.axis: use 0 to sort by your index versus 1 to sort by columns.level: if your axis is a MultiIndex (that's right you can have what amount to multiple columns for indices) you can group by particular levelsWe'll also use the sum() method in our chaining but it will be on a pandas.core.groupby.GroupBy object. If you look this up, you'll see that by default it calculates sums by group in each numeric column.
genera_non_0_andabove = ( # Use parenthesis to clean up your method chaining so it's easier to read
merged_subset.loc[(merged_subset["GENUS"].notna()) & # conditional #1
(merged_subset["count"] > 0), # conditional #2 for row selection
["OTU", "FAMILY", "GENUS", "count"] # column selection for GENUS and count columns
]
...(["FAMILY", "GENUS"]) # group by FAMILY and GENUS
.sum() # Get the sum of each group
.sort_values(by="count", ascending=False) # Sort the values in descending order
.reset_index()
)
# Take a look at the data
genera_non_0_andabove.head()
# What are the dimensions?
genera_non_0_andabove.info()
Let's turn genera_non_0_andabove into a barplot
genera_non_0_andabove.plot(kind=..., x = ..., y=..., alpha= 0.7, label = 'Non-zero genera',
figsize=(7, 4), cmap=plt.get_cmap('jet'))
plt.show()
That's our second class on Python! You've made it through and we've learned about about working with data files:
.xlsx and .csv file formats.Soon after the end of each lecture, a homework assignment will be available for you in DataCamp. Your assignment is to complete chapters 1-3 of Data Manipulation with pandas (Transforming Data, 950 possible points; Aggregating Data, 1300 possible points; and Slicing and Indexing, 1150 possible point). For optional practice, do chapter 4! It covers more on exploratory data analysis and a little bit of data visualization. This is a pass-fail assignment, and in order to pass you need to achieve a least 2550 points (75%) of the total possible points. Note that when you take hints from the DataCamp chapter, it will reduce your total earned points for that chapter.
In order to properly assess your progress on DataCamp, at the end of each chapter, please take a screenshot of the summary. You'll see this under the "Course Outline" menubar seen at the top of the page for each course. It should look something like this where it shows the total points earned in each chapter:
Submit the file(s) for the homework to the assignment section of Quercus. This allows us to keep track of your progress while also producing a standardized way for you to check on your assignment "grades" throughout the course.
You will have until 13:59 hours on Thursday, July 15th to submit your assignment just before the start of lecture that week.
Pandas cheatsheets:
Revision 1.0.0: materials prepared by Oscar Montoya, M.Sc. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.1.0: edited and prepared for CSB1021H S LEC0140, 06-2021 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
openpyxl package to open excel files¶In earlier years the xlrd package could be used to read excel spreadsheet files of the format .xls. However, in the past decade Microsoft has focused on the .xlsx format which the xlrd package supported but development on the xlrd package officially ended support for anything but the .xls format as of December 2020.
Now the official recommended package is openpyxl to open the more complex excel files. You can also use this package to make and save excel spreadsheet files so it's a good one to keep in your toolbox.
Let's start by importing the file miscellaneous.xlsx from the data directory by calling on the load_workbook() function.
#!pip install openpyxl
import openpyxl as opxl
# Open the correct file since we are in the data directory already
wb = opxl.load_workbook('miscellaneous.xlsx')
# What kind of object have we created?
type(wb)
.sheetnames property¶Now that we've generated this Workbook object, we can look at some of it's properties, like the names of the various sheets using the .sheetnames property. Remember this will be holding data in a complex hierarchy where we'll have to break down the information into something like a DataFrame object.
From there we can access individual sheets using the ["sheetname"] indexing operators.
# What are the sheetnames of our excel workbook
wb.sheetnames
microbes_sheet = wb["microbes"]
type(microbes_sheet)
.values property¶Now that we have pulled out a Worksheet object, we have to imagine that there are values there for us to use. The way to access this information on the various cells is with the .values property. We'll see, however, that it's not exactly a data structure like we imagine it.
# Try to get a copy of the worksheet data
microbes_data = wb["microbes"].values
# Convert the data using Pandas!
pd.DataFrame(microbes_data)
# What kind of object is returned when we call on the values of the sheet
print(type(microbes_data))
next command¶As you can see from above, we managed to convert our worksheet data into a DataFrame but it wasn't exactly a perfect conversion. Looking at the original object we get from the .values property, we get a generator object in return.
A generator is a special function that can make its way through an iterable object (like a list or dictionary) but can return a sequence of values and not just a single value. It will also yield it's execution wherever it was in the iterable object and return that state to you in another generator object. Sounds confusing right?
Simply put, this kind of function serves as an iterator for moving through the rows of our sheet value data one at a time using the next() command. This is helpful to know because looking closer at our DataFrame we see that the header information was deposited into the first row of our DataFrame rather than as column names. There are a few ways to remedy this but one way is to properly iterate through our generator with the next() command.
By default the next() command of our Worksheet.values generator will return a row from our values. Once it gets to the end of the structure, it will return a StopIeration error when calling next().
Let's take a look ourselves.
# Go to the next line in our values
next(microbes_data)
Why did we get the stop iteration? When we assigned it to a DataFrame above, it iterated through the whole .values generator and now there's nothing else to squeeze from it. We'll have to start over again.
# Re-assign the values data.
microbes_data = wb["microbes"].values
# Iterate through some of those rows
next(microbes_data)
# What type of object is this returning
type(next(microbes_data))
next() function returns a tuple¶That's right, our good friend the tuple is back and therefore we can slice it with the [] operators. That means we can pull out specific "columns" that we're interested in from each row.
# Get an entire row
next(microbes_data)[:]
# This grabs the entire row too
next(microbes_data)[0:]
# Slice the next row a little
next(microbes_data)[-5:-1]
next() to extract a header for a DataFrame¶Now that we know the inner workings a little more, we can grab the first row from the Worksheet.values generator and keep that for our columns. Then we can pass the rest of the generator to the initialization of the DataFrame. Let's see that in detail.
# Re-assign the values data.
microbes_data = wb["microbes"].values
# Grab the column names from the first "tuple" in our generator
colNames = next(microbes_data)[0:]
# Now we send the rest of the generator to be made into a DataFrame
microbes_df = pd.DataFrame(microbes_data, columns = colNames)
# Let's view some of the DataFrame
microbes_df.head()
microbes_df.shape
The Centre for the Analysis of Genome Evolution and Function (CAGEF) at the University of Toronto offers comprehensive experimental design, research, and analysis services in microbiome and metagenomic studies, genomics, proteomics, and bioinformatics.
From targeted DNA amplicon sequencing to transcriptomes, whole genomes, and metagenomes, from protein identification to post-translational modification, CAGEF has the tools and knowledge to support your research. Our state-of-the-art facility and experienced research staff provide a broad range of services, including both standard analyses and techniques developed by our team. In particular, we have special expertise in microbial, plant, and environmental systems.
For more information about us and the services we offer, please visit https://www.cagef.utoronto.ca/.